Background

Climate change is one of the biggest challenges facing humanity today resulting among other effects in more frequent and severe weather events. The Federal Emergency Management Agency (FEMA) is part of Homeland Security in the USA and responsible for coordinating the response to disasters and emergencies, both natural and man-made. FEMA provides support to state and local governments, as well as to individuals affected by disasters, by offering financial assistance, providing temporary housing, and assisting with recovery efforts. The agency also plays a key role in disaster preparedness and risk reduction through programs that help communities develop plans and take steps to reduce the impact of future disasters.

Description of data set(s)

FEMA provides various data sets on disasters that occurred in the past, some of them going back to the 1960s. For our project we merged two data sets from FEMA and enriched them with additional information like geographical data (e.g. longitude, latitude) and population size per state from Wikipedia. The first data set from FEMA contains for example information on the type of disaster (fire, flooding, etc.), the state in which the disaster occurred and the dates when it occurred. From the second data set, we pulled information on the amount of damage each disaster caused, measured in amount of money. Since the second data set only goes back to 2002, we decided to do our analysis in the time frame of the past 20 years.

After merging and cleaning the data from the four sources, our final data set to do the analysis on looked like this:

Disaster# Type County State Latitude Longitude Damage Approved Repair Population IncomeCapita DisasterBegin DisasterEnd Duration
1971 Severe Storm Marshall Alabama 34.3670 -86.3066 468614.48 116003.84 101089.45 96137 20382 2011-04-15 2011-05-31 46 days
4655 Flood Park Montana 45.4884 -110.5267 2901.34 12532.99 11078.99 16513 24611 2022-06-10 2022-07-05 25 days
4637 Tornado Cheatham Tennessee 36.2612 -87.0868 7753.23 14653.23 6344.23 40539 23459 2021-12-10 2021-12-11 1 days
4633 Tornado Jackson Arkansas 35.5992 -91.2145 199.50 199.50 0.00 16908 15609 2021-12-10 2021-12-11 1 days
4630 Tornado Lyon Kentucky 37.0191 -88.0833 0.00 0.00 0.00 8226 22123 2021-12-10 2021-12-11 1 days
4618 Hurricane Bucks Pennsylvania 40.3370 -75.1069 92390.42 52031.76 42999.73 627668 37466 2021-08-31 2021-09-05 5 days

Exploratory Data Analysis

Relative frequency of disaster types

Displaying the cumulative frequencies of each disaster type relative to each other, we can learn from the stacked bar chart that “severe storms” are by far the commonest disaster type, followed by hurricanes and biological disasters.

Distribution of disaster types over time

Various insights can be gained from the time plot below. For example, it looks like severe storms happen quite frequently in general except for the years of approximately 2013-2015 and 2017/2018-ish. Furthermore, only one volcanic eruption and one severe ice storm was recorded in the displayed time period. Dam or Levee breaks and biological disasters did not occur between 2000 and 2023. We could hypothesize that fires seemed to last longer in years previous to about 2013 compared to afterwards - maybe due to more efficient ways that were developed to extinguish them. However, in this plot, the duration of almost 500 disasters are shown which might overlap. Since using 500 different colors for each individual disaster might not help much, we would switch to a different way of showing this information, for example an interactive graphic with Shiny.

Duration per disaster type

Exploring the disaster types a little bit further, we can detect in the boxplot below that fires show the largest variation in terms of duration. The disaster with the biggest median duration is volcanic eruptions. However, this observation should be analyzed further since it looks like there was only one incidence in the analysed time frame (only the median is shown, no variation visible which means that there is only one data point in this group). Ouliers with very long durations can be detected in the groups of floods, hurricanes, severe storms, and tornados.

Damage per state

To analyse the data on damage, we first took a look at the distribution of the data points. The histogram indicated a right skewed distribution.

Also, the difference between the individual damage amounts are quite large which makes it hard to work with. Therefore, we took the logarithm of the data, which resulted in displaying a normal distribution:

With the transformed data, we could generate a plot showing the damage disasters have caused per state in the US over the past 20 years:

## function (x, ...) 
## {
##     if (need_screenshot(x, ...)) {
##         html_screenshot(x)
##     }
##     else {
##         UseMethod("knit_print")
##     }
## }
## <bytecode: 0x145a5a5c8>
## <environment: namespace:knitr>

Population per state

The following map shows the population per state. Similar to above, we are using the log-transformed data.

Actual Damage vs. approved amount

How much does the estimated damage and the approved amount differ? As before, we worked with log-transformed data.By displaying overlapping histograms, we can conclude that there are a few instances where the approved amount is a lot lower compared to the damage that actually happened. Vice versa, there are instances where the approved amount exeeded the actual damage.

Approved and actually used (repair)

When comparing the amount of money that was approved to repair the damage with the actual costs spent on reconstructions, we can see that in almost all cases, less money was spent than approved.

Diving into this in a bit more detail, we can distinguish between disaster types. For some, like tornados or severe ice storms, the approved amounts exceed the spent amounts far more compared to others like earthquakes or floods where the two amounts are much more equal.

boxplot.repair.df <- data.frame(ApprovedLog = log(df.prep.5$Approved),
                               RepairLog = log(df.prep.5$Repair),
                      Type = df.prep.5$Type)

boxplot.repair.num.df <- boxplot.repair.df %>% 
  filter_if(~is.numeric(.), all_vars(!is.infinite(.)))

b.rep.inf <- boxplot.repair.num.df %>%
  pivot_longer(ApprovedLog:RepairLog, names_to = "Names", values_to = "values") %>%
  ggplot(aes(y = values, x = Type, fill = Names))+
  geom_boxplot() + 
  facet_wrap(~Type, scale="free")+
  ggtitle('Approved and Repair Amount log-transformed without Inf Rows')
b.rep.inf + theme(
  plot.title = element_text(size=12, face="bold.italic"),
  axis.text.x=element_blank())

b.rep <- boxplot.repair.df %>%
  pivot_longer(ApprovedLog:RepairLog, names_to = "Names", values_to = "values") %>%
  ggplot(aes(y = values, x = Type, fill = Names))+
  geom_boxplot() +
  facet_wrap(~Type, scale="free")+
  ggtitle('Approved and Repair Amount log-transformed with Inf Rows')
b.rep+
  theme(
  plot.title = element_text(color = 'red',size=12, face="bold.italic"),
  axis.text.x=element_blank())

# ggarrange(b.rep.inf, b.rep, 
#           labels = c("With Infinite Numbers", "Without Infinite Number"),
#           ncol = 1, nrow = 2)

Shiny Dashboard for EDA

The Shiny-Dashboard provides an interactive possibility to further inspect aspects of the natural disasters and where they occur. https://milicapajkic.shinyapps.io/Dashboard/

Duration of Disaster Type by State and County

Heatmap

Model

The goal of this model fitting chapter is to see if the linear regression is appropriate tool to predict/explain the damage amount and state.

The sum of the damage across one state shows a light right skewedness. Therefore, we performed a log transformation. Now it is more of a normal distribution.

Model 2

In this Chapter we want to see, if the amount of Damage can be predicted with the given data points.

H1: With a higher income per capita, the damage will be lower.

The reason being, that with higher income capita, the preparation and precaution for natural disaster are higher. For one, there is no monetary strain like in other counties with less income per capita.

H2: The damage will be lower, depending which natural catastrophe is happening.

First, we will need to see the distribution of the important variables. It appears, that damage and income per capita needs to be logarithmized. So we logarithimized it.

The scatter plot indicates that there is a positive relationship between income per capita and damage.

To see if the chosen variables have an influence on damage a testing was done (drop1). The model.full indicates that only log.income, Type and Duration have relevant effects on the dependent variable damage.

ggplot(data = df.prep.5, 
       mapping = aes(x = IncomeCapita, fill = '')) +
  geom_histogram(show.legend = FALSE)+
  labs(x = 'Income', y = 'Count') +
  scale_fill_manual(values = c('lightblue2'))+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=18,face="bold"))

ggplot(data = df.prep.5, 
       mapping = aes(x = log(IncomeCapita), fill = '')) +
  geom_histogram(show.legend = FALSE)+
  labs(x = 'log. Income', y = 'Count') +
  scale_fill_manual(values = c('mistyrose3'))+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=18,face="bold"))

plot(df.prep.5$Damage, df.prep.5$IncomeCapita)

plot(log(df.prep.5$Damage), log(df.prep.5$IncomeCapita))

ggplot(df.prep.5, aes(log(IncomeCapita),log(Damage))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Linear Regression

df.model <- df.prep.5
df.model$log.damage <- log(df.model$Damage)
df.model$log.income <- log(df.model$IncomeCapita)
# 
df.model <- df.model[!is.na(df.model$log.damage) & is.finite(df.model$log.damage), ]
# 
model <- lm(log.damage ~ log.income, data = na.omit(df.model))
summary(model)
## 
## Call:
## lm(formula = log.damage ~ log.income, data = na.omit(df.model))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2833 -1.7712 -0.1211  1.6129  8.7048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -3.5667     5.2192  -0.683   0.4949  
## log.income    1.2822     0.5184   2.473   0.0139 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.421 on 309 degrees of freedom
## Multiple R-squared:  0.01941,    Adjusted R-squared:  0.01624 
## F-statistic: 6.117 on 1 and 309 DF,  p-value: 0.01392

Model with all the variables

In this model we are fitting all the variables and are afterwards implementing drop1 to see, which variable has a significant influence on the dependent variable. After this, we are dropping the variables with no significant influence on the variable damage. The final model includes income, type of disaster and the duration.

model.full <- lm(log.damage ~ log.income + State+ Type+ Population+ Duration, data = na.omit(df.model))
summary(model.full)
## 
## Call:
## lm(formula = log.damage ~ log.income + State + Type + Population + 
##     Duration, data = na.omit(df.model))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.364 -1.379  0.000  1.358  6.059 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)           -4.882e+00  7.277e+00  -0.671   0.5029  
## log.income             1.451e+00  7.046e-01   2.060   0.0405 *
## StateAlaska           -1.311e+00  1.723e+00  -0.761   0.4473  
## StateArkansas          9.686e-01  9.610e-01   1.008   0.3145  
## StateCalifornia        8.646e-01  1.401e+00   0.617   0.5377  
## StateColorado         -1.704e+00  1.760e+00  -0.969   0.3337  
## StateConnecticut      -4.361e-01  1.397e+00  -0.312   0.7551  
## StateDelaware         -1.544e+00  2.424e+00  -0.637   0.5246  
## StateFlorida           1.421e-01  8.456e-01   0.168   0.8667  
## StateGeorgia          -8.184e-01  9.437e-01  -0.867   0.3867  
## StateHawaii           -4.165e-01  1.550e+00  -0.269   0.7884  
## StateIllinois          3.577e-01  9.325e-01   0.384   0.7016  
## StateIndiana          -3.766e-01  9.327e-01  -0.404   0.6867  
## StateIowa             -4.459e-01  1.331e+00  -0.335   0.7379  
## StateKansas            4.305e-02  1.475e+00   0.029   0.9767  
## StateKentucky         -9.493e-02  8.535e-01  -0.111   0.9115  
## StateLouisiana         1.670e-01  8.970e-01   0.186   0.8524  
## StateMaine            -1.158e+00  1.760e+00  -0.658   0.5111  
## StateMaryland          1.438e+00  1.782e+00   0.807   0.4205  
## StateMassachusetts    -9.197e-01  1.255e+00  -0.733   0.4645  
## StateMichigan          4.720e-01  1.504e+00   0.314   0.7539  
## StateMinnesota        -1.209e+00  1.336e+00  -0.905   0.3662  
## StateMississippi       9.258e-01  7.989e-01   1.159   0.2476  
## StateMissouri         -3.151e-01  9.981e-01  -0.316   0.7525  
## StateMontana          -1.686e+00  1.797e+00  -0.938   0.3492  
## StateNebraska         -2.212e+00  1.487e+00  -1.487   0.1382  
## StateNevada            5.371e+00  2.382e+00   2.254   0.0250 *
## StateNew Hampshire    -1.793e+00  1.353e+00  -1.325   0.1864  
## StateNew Jersey       -8.625e-01  1.095e+00  -0.788   0.4317  
## StateNew Mexico        5.113e+00  2.382e+00   2.147   0.0328 *
## StateNew York         -1.380e+00  9.197e-01  -1.501   0.1346  
## StateNorth Carolina   -2.630e-01  1.084e+00  -0.243   0.8086  
## StateNorth Dakota     -7.434e-01  1.839e+00  -0.404   0.6863  
## StateOhio             -4.104e-02  1.001e+00  -0.041   0.9673  
## StateOklahoma         -1.962e-01  9.490e-01  -0.207   0.8364  
## StateOregon           -5.421e-02  1.755e+00  -0.031   0.9754  
## StatePennsylvania     -7.499e-03  1.166e+00  -0.006   0.9949  
## StateRhode Island      2.189e-01  2.450e+00   0.089   0.9289  
## StateSouth Carolina    1.276e+00  1.321e+00   0.965   0.3353  
## StateSouth Dakota     -1.277e+00  1.369e+00  -0.933   0.3518  
## StateTennessee        -7.755e-01  9.151e-01  -0.847   0.3975  
## StateTexas            -1.098e-01  8.562e-01  -0.128   0.8980  
## StateVermont          -1.653e+00  2.395e+00  -0.690   0.4907  
## StateVirginia         -6.829e-01  1.150e+00  -0.594   0.5532  
## StateWashington        8.993e-02  1.753e+00   0.051   0.9591  
## StateWest Virginia    -7.573e-01  8.890e-01  -0.852   0.3951  
## StateWisconsin         4.020e-01  1.478e+00   0.272   0.7859  
## StateWyoming          -2.834e+00  2.434e+00  -1.165   0.2452  
## TypeEarthquake         1.446e+00  2.973e+00   0.486   0.6272  
## TypeFire               1.291e+00  2.933e+00   0.440   0.6603  
## TypeFlood             -2.039e-01  2.692e+00  -0.076   0.9397  
## TypeHurricane         -4.446e-01  2.707e+00  -0.164   0.8697  
## TypeOther              3.339e+00  3.645e+00   0.916   0.3606  
## TypeSevere Ice Storm  -9.198e-01  3.163e+00  -0.291   0.7715  
## TypeSevere Storm      -5.954e-01  2.674e+00  -0.223   0.8240  
## TypeTornado           -2.885e+00  2.785e+00  -1.036   0.3014  
## TypeVolcanic Eruption  7.260e+00  3.827e+00   1.897   0.0590 .
## Population             1.587e-07  2.442e-07   0.650   0.5165  
## Duration               1.327e-02  6.997e-03   1.897   0.0590 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.302 on 252 degrees of freedom
## Multiple R-squared:  0.2769, Adjusted R-squared:  0.1105 
## F-statistic: 1.664 on 58 and 252 DF,  p-value: 0.004167
drop1(model.full, test = 'F')
model.right <- lm(log.damage ~ log.income +  Type+ Duration, data = na.omit(df.model))
summary(model.right)
## 
## Call:
## lm(formula = log.damage ~ log.income + Type + Duration, data = na.omit(df.model))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.960 -1.570 -0.078  1.551  5.864 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            1.750439   5.472425   0.320   0.7493  
## log.income             0.837698   0.497469   1.684   0.0932 .
## TypeEarthquake         1.007025   2.474533   0.407   0.6843  
## TypeFire               1.504898   2.482205   0.606   0.5448  
## TypeFlood             -0.929765   2.316326  -0.401   0.6884  
## TypeHurricane         -0.813160   2.297781  -0.354   0.7237  
## TypeOther              3.478853   3.221237   1.080   0.2810  
## TypeSevere Ice Storm  -1.994636   2.789519  -0.715   0.4751  
## TypeSevere Storm      -1.207145   2.284732  -0.528   0.5976  
## TypeTornado           -3.328925   2.388800  -1.394   0.1645  
## TypeVolcanic Eruption  6.932481   3.280384   2.113   0.0354 *
## Duration               0.009687   0.006022   1.609   0.1088  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.278 on 299 degrees of freedom
## Multiple R-squared:  0.1602, Adjusted R-squared:  0.1293 
## F-statistic: 5.184 on 11 and 299 DF,  p-value: 1.822e-07

Fitted Values

fitted.model <- fitted(model)
str(fitted.model)
##  Named num [1:311] 9.16 9.4 9.34 8.81 9.94 ...
##  - attr(*, "names")= chr [1:311] "1" "2" "3" "4" ...
plot(log.damage ~ log.income, data = df.model,
     main = "Model 'model'",
     col = "pink4")

# points(fitted.model ~ log.income,
#        col = "lightblue4",
#        pch = 19,
#        data = df.model)
# abline(model, col = "red4")
df.sub <- df.model[which(!is.na(fitted.model)), ]
# Create plot with both observed and fitted values
plot(log.damage ~ log.income, data = df.sub,
     main = "Model 'model'",
     col = "pink4")
points(fitted.model ~ log.income,
       col = "lightblue4",
       pch = 19,
       data = df.sub)
abline(model, col = "red4")

# df.plot <- data.frame(log.damage = df.model$log.damage, log.income = df.model$log.income, fitted = fitted.model)
# 
# # Plot the observed and fitted values
# plot(log.damage ~ log.income, data = df.plot, main = "Model 'model'", col = "pink4")
# points(fitted ~ log.income, data = df.plot, col = "lightblue4", pch = 19)
# abline(model, col = "red4")

Chapter of choice

For the chapter of choice we have chosen the shiny dashboard application. The focus lies on the exploratory analysis and to make it interactive for users. @ MARTINA: in the folder shiny there are the first tries

For every analysis some kind of outlier detection is a way to see if the data has some structure. The way this works, multiple variables are given and from these groups are being formed. The goal is to have big differences between the formed groups. However, within group difference should be small.

#Dataframe
df.outlier.0 <- data.frame(Approved = df.prep.5$Approved, 
                              Repair = df.prep.5$Repair,
                           income = df.prep.5$IncomeCapita,
                           population = df.prep.5$Population,
                           damage = df.prep.5$Damage)

df.outlier.1 <- data.frame(repair = df.outlier.0$Repair,
                           income = df.outlier.0$income)
plot(df.outlier.1)

Clustering the two variables income and repair

When we scale the data we get one big cluster. By examining, after the initial clustering, we can see that the elbow is at 2 or 5. The further analysis with plots of the clusters and silhouette plot gives the indication that two clusters are the better solution. This means

#prep the data
datas <- scale(df.outlier.1, center = FALSE, scale = TRUE)
datas <- na.omit(datas)
boxplot(datas)

plot(datas)

dists <- dist(datas)
km <- kmeans(datas, centers = 3, nstart = 10)
groups_km <- km$cluster
groups_km
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [186] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [223] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [297] 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [334] 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [445] 3
cluster_size <- cbind(sum(groups_km == 1), sum(groups_km == 2), 
                      sum(groups_km == 3))
cluster_size
##      [,1] [,2] [,3]
## [1,]    1    4  440
plot(datas, pch = groups_km, col=groups_km, lwd=2)
legend("topright", legend = 1:3, pch = 1:3, col=1:3, bty="n")

reps <- rep(0, 6)
for (i in 1:6) reps[i] <- sum(kmeans(datas, centers = i, nstart = 20)$withinss)
par(mfrow = c(1,1))
plot(1:6, reps, type = "b", xlab = "Number of groups", ylab = "Sum of squares")
text(3, 120, "Elbow point signifies how many \ngroups there could be", col = "red", cex = 1.08, pos = 4)

km2 <- kmeans(datas, centers = 2, nstart = 10)
groups_km2 <- km2$cluster
groups_km2
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [445] 2
cluster_size2 <- cbind(sum(groups_km2 == 1), sum(groups_km2 == 2))
cluster_size2
##      [,1] [,2]
## [1,]    2  443
plot(datas, pch = groups_km2, col=groups_km2, lwd=2)
legend("topright", legend = 1:2, pch = 1:2, col=1:2, bty="n")

km5 <- kmeans(datas, centers = 5, nstart = 10)
groups_km5 <- km5$cluster
groups_km5
##   [1] 2 2 2 2 2 5 3 5 2 2 2 2 5 5 2 5 2 5 2 2 2 2 2 2 2 2 2 2 5 5 5 2 5 2 2 2 2
##  [38] 2 2 5 5 2 2 2 5 5 2 2 2 2 5 5 2 2 5 2 2 2 5 3 5 5 5 2 2 2 5 2 2 5 5 2 5 2
##  [75] 5 2 2 5 2 2 2 2 2 5 5 2 5 2 5 2 2 5 2 2 2 2 5 2 2 5 5 5 2 2 2 5 2 5 5 5 5
## [112] 2 2 2 5 2 5 5 5 2 2 5 5 5 5 2 2 5 5 2 2 5 2 2 2 5 2 2 2 5 5 5 5 2 5 5 2 2
## [149] 5 2 2 5 2 2 5 2 2 2 2 2 2 2 5 5 5 2 2 5 2 5 2 2 5 2 2 2 2 2 2 5 2 2 5 5 5
## [186] 5 2 2 2 3 5 2 2 2 2 2 2 2 2 5 2 2 2 2 2 2 2 2 2 2 5 2 2 2 2 5 5 2 2 5 5 2
## [223] 2 2 2 2 2 5 2 5 5 1 5 5 5 5 2 5 2 2 2 5 2 2 2 2 5 2 2 2 2 2 5 2 5 2 2 2 5
## [260] 2 2 5 5 2 5 5 5 2 5 2 5 5 2 2 2 4 5 2 2 5 2 2 2 2 2 2 2 2 5 2 2 2 2 2 2 2
## [297] 3 2 2 2 2 2 2 5 2 2 2 2 2 2 2 5 2 2 2 5 2 2 2 2 2 5 2 2 2 2 2 2 2 2 2 2 2
## [334] 2 2 2 2 5 3 5 2 2 2 2 2 5 2 2 5 5 5 2 2 2 5 5 5 2 2 5 5 5 2 5 5 3 2 2 2 2
## [371] 5 2 2 2 5 5 5 5 5 5 2 5 5 5 2 5 2 2 5 5 5 5 2 5 2 2 5 5 2 2 2 2 2 2 2 2 2
## [408] 5 2 2 5 2 2 5 2 5 5 2 2 5 5 2 2 2 2 2 2 2 2 2 2 2 2 5 2 2 2 2 5 2 2 2 2 2
## [445] 2
cluster_size5 <- cbind(sum(groups_km5 == 1), sum(groups_km5 == 2), 
                      sum(groups_km5 == 3), sum(groups_km5 == 4),sum(groups_km5 == 5))
cluster_size5
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1  288    6    1  149
plot(datas, pch = groups_km5, col=groups_km5, lwd=2)
legend("topright", legend = 1:5, pch = 1:5, col=1:5, bty="n")

plot(silhouette(groups_km5, dists))

Chapter of choice (Martina)

To make up for the time I missed during the block week on Monday afternoon, I was given the task to produce an additional chapter of choice with a R package that hasn’t been used before. Since we have date values in our data set, I thought it would be interesting to try and show a forecast. I started by generating a new dataframe containing the frequency with which “severe storms” happened per month between 2002 and 2023. Since (luckily!) there wasn’t such a disaster happening every month, I complemented the data frame by adding every month in the given time period and fill in “0” if there was no event in that month. To generate the time series plot, I loaded the packages “forecast” and also “zoo”, because for some reason, R converted the date column into a different index before using it as index for the x-axis. The zoo package allowed to specify the original date column as index for the x-axis.

library(forecast)
library(zoo)
Sstorms <- subset(df.prep.5, Type == "Severe Storm")
Sstorms_timeonly <- Sstorms[,c("Disaster#", "DisasterBegin")]

#calculating the frequency of severe storms per month
Sstorms_month_year <- format(Sstorms_timeonly$DisasterBegin, "%Y-%m")
freq_table <- table(Sstorms_month_year)

#generating the data frame
frequency <- data.frame(Sstorms_month_year = names(freq_table), frequency = as.numeric(freq_table))

#converting the date format to a format R can recognize for the time series plot
frequency$Sstorms_month_year <- as.Date(paste0(frequency$Sstorms_month_year, "-01"), format = "%Y-%m-%d")

##filling in all months without a severe storm happening and adding 0 values 
# Create a sequence of dates covering the entire range
start_date <- min(frequency$Sstorms_month_year)
end_date <- max(frequency$Sstorms_month_year)
all_dates <- seq(start_date, end_date, by = "month")

# Create a new data frame with missing months filled with 0 values
dates_df <- data.frame(date = all_dates)

colnames(frequency) <- c("date", "frequency")

# Merge the new data frame with the original data frame
dates_merged <- merge(frequency, dates_df, by = "date", all = TRUE)

# Replace missing values with 0
dates_merged[is.na(dates_merged$frequency), "frequency"] <- 0

#somehow, R did not use the date column for the x-axis but converted it first into other indexes. 
#Therefore, used the zoo package to create the time series object and specify
#date column as index

ts <- zoo(dates_merged$frequency, order.by = dates_merged$date)
frequency_tszoo <- zoo(dates_merged$frequency, order.by = dates_merged$date)

# Plot the time series with the original dates as the x-axis
plot(frequency_tszoo, type = "l", xlab = "Date", ylab = "Frequency", main = "Frequency of severe storms / month")

The time series plot indicates that there is a trend to decreasing frequency of severe storms in later years but no seasonal component can be clearly identified. Trying to decompose the data anyway was therefore not successful and resulted in the error message that the time series is not periodic or has less than two periods.

#stl <- stl(frequency_tszoo, na.action = "omit", s.window = 12)

In another try, I wanted to look at the autocorrelation and partial autocorrelation functions to get an indication if it could make sense to fit a linear model like an Auto-Regressive model on the data for the forecast. The acf resulted in the following plot whereas the pacf resulted in an error message that the maximal lag must be at least one.

acf(frequency_tszoo, na.action = na.pass)

#library(ggfortify)
#ggPacf(frequency_tszoo)

By looking at the time series plot, I would also suggest that it’s quite hard to do an accurate forecast since the frequency dropped considerably in the past few years. Therefore, I suggest we would need to get additional data to confirm that this is a long-lasting drop which we could base the forecast on.

PS: Just to try it out, I fitted an AR(1) model on the data as a whole and predicted the next 100 data points which resulted in non-nonsensical horizontal line:

PS: Just to try it out, I fitted an AR(1) model on the data as a whole and forecasted the next 100 data points which resulted in non-nonsensical horizontal line:

f_fit <- arima(frequency_tszoo, order = c(1,0,0))
forecast <- predict(f_fit, n.ahead = 100)
plot(frequency_tszoo, 
     xlim = c(18000, 19737),
     xlab = "Time",
     ylab = "Frequency")
lines(forecast$pred, lwd = 2, col = "red")

Summary & Conclusion

In this project, we focused on exploratory data analysis and different ways of displaying data. As it is often the case for amounts, count data, etc., our variables like amount of money tended to be right skewed which is why we mostly worked with log-transformed data. The long-transformation helps for example to generate figures that are easier to interpret in terms of detecting differences. However, the caveat is that the “real numbers” cannot be directly read from the graph. Furthermore, the log transformation “shifts” right-skewed data into a more gaussian distribution which is important when the goal is to use statistical analysis methods or machine learning models since most of the methods assume an underlying Gaussian distribution of the data. To generate plots, we quickly started to like the ggplot package. The explanation in the block seminar that ggplot uses a layered approach was very helpful and made using it and starting to play around much more easy and fun. Also the session on R Markdown was incredibly helpful and overall, we are taking a lot of great tips and tricks with us that will for sure make our daily coding life more enjoyable :-)